DATASET OVERVIEW

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

White wine dataset contains 4898 observations with 13 variables. Column ‘X’ is just a replicate of observation order and reveals no useful information for my analysis, hence it will be dropped. The response variable ‘quality’ is considered an integer variable, but i would like to convert it into an ordered factor to reveal more useful information.

DATA PREPROCESSING

Number of observations remained the same, hence there were no missing observations in the dataset.

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

UNIVARIATE PLOTS SECTION

In this section, i would first summarize the variables in my dataset.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##  quality 
##  3:  20  
##  4: 163  
##  5:1457  
##  6:2198  
##  7: 880  
##  8: 175  
##  9:   5

The ordered response variable quality whose range falls from 0 to 9 has only ratings provided on a scale from 3 to 9. Majority of the votings for wine quality is at 6. The extreme values 3 and 9 which correspond to poor and excellent quality are quite low. This gives a good distribution reason for exploring the variable thereby gaining more insights revealing useful information.

There is something unusual about residual sugar variable. Its maximum value shows large deviation from its median and mean values.The variable guidelines state that any residual sugar values greater than 45 would be considered sweet.I would like to find this in the following sections.

Rest of the variable distribution looks normal.

## 
## FALSE  TRUE 
##  4897     1
## 
## FALSE  TRUE 
##  4879    19
## 
## FALSE  TRUE 
##  2700  2198
## 
## FALSE  TRUE 
##  3441  1457

There is a single observation which is considered sweet white wine based on the guidelines.

UNIVARIATE PLOTS

I would like to graphically examine the distribution of inidvidual variables using histogram.

Fixed acidity range around 7 ranks with highest count.There are outliers with low and high extreme values with least count.

Volatile acidity range around 0.2 to 0.4 ranks with majority counts.There are outliers at high extreme end with least count.

Citric acid acidity range around 0.2 to 0.4 ranks with highest count. I observe a sharp peak with more than 200 counts at 0.5 value. There are outliers with low and high extreme values with least count.

Fixed acidity range around 7 ranks with highest count.There are outliers with low and high extreme values with least count.

Residual sugar follows skewed distribution, and there is a sharp peak observed at low residual value. This needs attention.

Fixed acidity range around 7 ranks with highest count.There are outliers with low and high extreme values with least count.

Free sulfur dioxide follws god normal distribution with few outliers at its high extreme end. Maximum count happens at midrange 50.

Total sulfur dioxide has a dense region between 100 to 200 range. There are outliers with low and high extreme values with least count.

Density follows a good normal distribution, with dense values ranging between 0 to 1 and highest count has density values approximately around 0.5.There are outliers at high extreme values.

pH variables reveals the best normal distribution so far, and majority wine samples have pH around 3.1 to 3.3. There are outliers samples with pH values greater than 3.6 and less than 2.8.

Sulphates highest count is observed between 0.4 and 0.5 range. There are samples with least counts having high sulphate values.

Alcohol distribution is kind of unusual as it is more deviated towards right and in decreasing order. Majority of wine samples have alcohol ranges between 9 to 11.

All variables follow normal distribution, except the residual sugar and alcohol which shows positive skewed distribution. In the next analysis i try to fix the bin width for more accurate analysis and also use log transformation for residual variable to nullify the skewed effect.

Above histograms show normal distribution for most of the variables except residual sugar. Residual sugar has long tailed distribution. I will use log transformation to obtain a better representation for this variable. Also in the distribution for some of the variables, the bell curve shifts towards left, this may be due to traces of outliers present. I would like to investigate it in future analysis.

Now the residual sugar has a normal distribution. As seen from both the plots, it appears a bimodal distribution with two spikes observed.

Distribution appears better for most of the variables after removing the outliers. Response variable distribution is analyzed in the following section.

We can note from the histogram that most of the wine quality is rated 6 and there is less wines with excellent and poor condition. The distribution appears normal.

CREATE NEW VARIABLE

##  free_total_sulfur.dioxide       SA        
##  Min.   :0.02362           Min.   :0.0566  
##  1st Qu.:0.19093           1st Qu.:0.1575  
##  Median :0.25368           Median :0.4906  
##  Mean   :0.25558           Mean   :0.6423  
##  3rd Qu.:0.31579           3rd Qu.:0.9773  
##  Max.   :0.71053           Max.   :5.6239
## 'data.frame':    4898 obs. of  2 variables:
##  $ free_total_sulfur.dioxide: num  0.265 0.106 0.309 0.253 0.253 ...
##  $ SA                       : num  2.352 0.168 0.683 0.859 0.859 ...
##  quality 
##  3:  20  
##  4: 163  
##  5:1457  
##  6:2198  
##  7: 880  
##  8: 175  
##  9:   5

SA variable shows some unsusal maximum value deviating largely from its mean. This meay be due to sweet wine having high residual sugar level after fermemtation.

Free to total sulfur dioxide graph looks normally distributed. Sugar to alcohol ratio shows bimodal behaviour as exhibited by their idividual plots alone.

UNIVARIATE ANALYSIS

WHAT IS THE STRUCTURE OF YOUR DATASET?

The original dataset comprised of 4898 observations with 12 variables. All the attributes are quantitative in nature containing number ranges thus none of them are factor variables. The response variable quality is an ordered factor as it is classified as ratings. Quality variable range falls from 3 to 9 in this dataset observations though its actual range is from 1 to 10. Median of the ratings is 6, so most of the wines taste is considered good quality.

Residual sugar: There is one wine whose residual sugar value is 65.8 and according to the description of attributes, this wine is considered sweet. Residual sugar follows bimodal distribution ,reveals certain sample of wines with less sugar and certain sample with more sugar content.

Citric acid: 19 wine samples have no citric acid present to add to their freshness.

Quality: Almost 75% of wine samples have ratings 5 and 6, with majority of the rating being 6.

What is/are the main feature(s) of interest in your dataset?

Quality which is the response variable is of main interest and i would like to determine the effect of significant variables on quality. Alcohol is another feature of interest whose levels might impact the quality of wine.

What other features in the dataset do you think will help support your into your feature(s) of interest?

I would like to investigate the new variable SA which is sugar to alcohol ratio as it showed bimodal behaviour. Residual sugar will also be scrutinized in future analysis.

Did you create any new variables from existing variables in the dataset?

Exploring the dataset, i created a new variable for sulphur dioxide taking the ratio of free to total sulphur dioxide and also SA variable to investigate the sugar to alcohol ratio that might affect the quality of wine.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

SA variable seems to be unusual as its maximum value shows a lot of deviation from its mean. Residual sugar too holds the same status.

I changed the quality variable to an ordered factor for better analysis purpose as quality has ratings observed from 3 to 9.

I applied log transformation to residual sugar to get a normal distribution but the result showed a more bimodal distribution which conveys groupings of measurements at extreme values. Further investigation will be carried for any unusual behaviour observed so far.

BIVARIATE PLOTS

The chart is difficult to read as it includes huge chunks of feature pairs. I will be including png image of the same to my project for better visualization and interpretation.

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and quality_new
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747
## 
##  Pearson's product-moment correlation
## 
## data:  quality_new and density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233
## 
##  Pearson's product-moment correlation
## 
## data:  quality_new and chlorides
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2365501 -0.1830039
## sample estimates:
##        cor 
## -0.2099344

I observe moderate correlation of variables with quality. Examining all these pairs via plots would validate the reason for such correlation.

Exploring bivariate plots of quality with rest of the features via box plots.

The boxplot reveals significant information about the relationship between alcohol and quality. It seems quite evident that the mean value of alcohol is less for lesser quality ratings and there is a sudden shift in mean of the alcohol levels with an increase in quality ratings from 6 to 9. Also majority of wine samples in the datset has quality rating 6.

Chlorides tend to show some deviation with quality ratings prefereably 5 and 6 as there are quite a number of samples whose chloride level value is far from its mean. An outlier investigation sounds necessary.

Density level shows a little downward tendency with an increase in quality rating.

I would like to create scatterplot and boxplot to investigate more about the relationship between variables of interest. Adding more clarity to the plot through follwing actions: 1) Adding jitter for better visualization 2) Limiting the axis to remove outliers observed 3) Changing transperancy to prevent overplotting.

The plot between quality and alcohol with jitter shows better clarity infering the correlation between them is significant. We can see a slight upward shift from left to right and the same is revealed in the box plot.As alcohol content increases from 11 to 13, we see there that the quality level shoots up to 8 and 9. The judges seem to prefer more alcohol content in their wine for better quality.

The plot between quality and density confirms with the negative correlation previously noted. As the quality level increases we tend to notice the decrease in the range of density values. There is a downward shift observed from left to right. This confirms that high quality ratings have lesser density values.

The plot between quality and chlorides shows partial negative correlation between them. There is dense plot of chloride points observed between 002 to 0.06. Chloride values greater than 0.08 have quality ratings between 4 to 6 approximately.

I find the relation between quality and alcohol to be more significant and would like to draw further inference on them.

Histogram confirms the inference noted previously that as alochol content increases, the quality of the wine increases too. This is visible in the histogram with quality level being more than 7 for higher alcohol contents. I want to explore the summary of alchol after grouping with quality.

## # A tibble: 7 x 5
##   quality mean_alcohol median_alcohol min_alcohol max_alcohol
##   <ord>          <dbl>          <dbl>       <dbl>       <dbl>
## 1 3              10.3           10.4         8.00        12.6
## 2 4              10.2           10.1         8.40        13.5
## 3 5               9.81           9.50        8.00        13.6
## 4 6              10.6           10.5         8.50        14.0
## 5 7              11.4           11.4         8.60        14.2
## 6 8              11.6           12.0         8.50        14.0
## 7 9              12.2           12.5        10.4         12.9

This summary provides more insight about the alchol content present for each quality rating.

Histogram reveals higher quality ratings are observed at lower density levels lesser than 0.992.

## # A tibble: 7 x 5
##   quality mean_density median_density min_density max_density
##   <ord>          <dbl>          <dbl>       <dbl>       <dbl>
## 1 3              0.995          0.994       0.991       1.00 
## 2 4              0.994          0.994       0.989       1.00 
## 3 5              0.995          0.995       0.987       1.00 
## 4 6              0.994          0.994       0.988       1.04 
## 5 7              0.992          0.992       0.987       1.00 
## 6 8              0.992          0.992       0.987       1.00 
## 7 9              0.991          0.990       0.990       0.997

Highest quality ratings have least density as declared before.

Alcohol variable is if utmost interest and finally i would like to build a linear model for the same to make a note of predictions i get from the model.

## 
## Call:
## lm(formula = I(quality_new) ~ I(alcohol), data = WQ)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5317 -0.5286  0.0012  0.4996  3.1579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.582009   0.098008   5.938 3.08e-09 ***
## I(alcohol)  0.313469   0.009258  33.858  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1896 
## F-statistic:  1146 on 1 and 4896 DF,  p-value: < 2.2e-16

The scatterplot does not conifrm much of linearity between this relationship. The points are way too scattered for linearity. Summary of linear model shows that the R squared value is too low infering that the predictive capability of the model is less despite obseeving moderate correaltion between these two variables. The model however follows normal distribution. To increase the predcitive power of the model, non-linear fit should be considered by introducing either new variable or higher order terms to the model.

For bivariate analysis purpose, i will be focusing on correlations between rest of the variables keeping aside quality which is of main feature interest. The correlations oberserved are as follows

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376
## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and residual.sugar
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312
## 
##  Pearson's product-moment correlation
## 
## data:  residual.sugar and density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

Density, alcohol and residual sugar tend to have strong correlations. I will scrutinie their relationships visually.

As it is evident from the above graphs, density and residual sugar are positively correalted owing to the fact that higher values of residual sugars have higher densities. Density and alcohol are highly negatively correlated meaning, more the alcohol content less dense it will be. There is slight inference obtained from residual sugar and alcohol graph, lower levels of alcohol have higher sugar content and vice versa.

BIVARIATE ANALYSIS

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I investigated quality which is my main feature of interest with fairly correlated variables such as alcohol, density and chlorides.I found an interesting relationship between quality and alcohol, as alcohol content increased from 11 to 13,the quality of wines too increased from 8 to 9 ratings. I belive the correlation fluctuates between these variables as higher alcohol content might deteriorate the quality of wine as it might taste too strong. The linear model also confirmed the same showing lot of scatter points making them own a non-linear relationship. Thus right amount of alcohol content is a crucial factor considering the quality of wine.

Also the density and alcohol were positively correlated owing to the fact that as alcohol content increased, the density of wine too was high. Thus a high rated quality wine will be more dense in nature.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes, there was a strong positive relationship between density and sugar variables. Higher sugar content will have high densities. Also alcohol and density are negatively correlated , meaning more the alcohol content, less dense the wine will be. There was a marginal relationship between alcohol and sugar, lower levels of alchol tends to have higher sugar content. Thus in obtaining a wine with good quality, these variables with suitable quanity might play a significant role.

What was the strongest relationship you found?

As discussed in previous sections, relationship between alcohol and density was found to be significant.

MULTIVARIATE PLOTS

As observed, the residual sugar showed bimodal distribution. With this in mind i would like to convert this variable and other vairables of interest into categorical variables for meaningful interpretations using cut function as learnt in the course.

## 'data.frame':    4898 obs. of  19 variables:
##  $ fixed.acidity            : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity         : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid              : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar           : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides                : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide      : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide     : num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density                  : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                       : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates                : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol                  : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality                  : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ free_total_sulfur.dioxide: num  0.265 0.106 0.309 0.253 0.253 ...
##  $ SA                       : num  2.352 0.168 0.683 0.859 0.859 ...
##  $ residual.sugar.cat       : Factor w/ 4 levels "(0.6,1.2]","(1.2,5.2]",..: 4 2 3 3 3 3 3 4 2 2 ...
##  $ density.cat              : Factor w/ 5 levels "(0.94,0.987]",..: 5 3 4 4 4 4 4 5 3 3 ...
##  $ chlorides.cat            : Factor w/ 4 levels "(0.008,0.036]",..: 3 3 3 4 4 3 3 3 3 3 ...
##  $ alcohol.cat              : Factor w/ 4 levels "(8,9.5]","(9.5,10.4]",..: 1 1 2 2 2 2 2 1 1 3 ...
##  $ quality.cat              : Factor w/ 3 levels "(2,5]","(5,7]",..: 1 1 1 1 1 1 1 1 1 1 ...

Abover variables distribution follows interqaurtile range as obtained in summary. I would like to interpret quality variables according to their split as follows. Quality ratings less than 3 are considered poor, between 3 to 6 as intermediate and anything above 7 as good quality wine. Variables of interest as discussed before will be analyzed in a single plot.

The plot reveals that at lower levels of alcohol, density is more and vice versa. Higher levels of density shows high residual sugar content visible through purple dots. Higher alcohol levels are associated with lower residual sugar contents visible through dense green points.

Analyzing the variables that had moderate correlation with quality variable in a single plot.

The single visualization is not quite clear for interpretation. A note can be made about the chloride content from this visualiztion that, higher chloride content can be seen at lower levels of alcohol and also a dense scattering of the same is found at quality levels ranging between 5 to 7. Higher chloride content occurs at higher alcohol levels with quality level ranging between 6 to 8.

I would like to evaluate density and chlorides features against quality for different levels of alcohol to further clarify the inference obtained by boxplots.

First chart depicts that median density decreases as alcohol content increases across all quality levels. This has been consistent throughout my analysis. Second chart depicts that median chlorides value increases for the lowest alcohol range as the quality ratings increase from 6 to 8 and for other alcohol ranges the median value of chlorides decreases for increase in quality rating from 6 to 8. In general, high chloride content are found in lesser alcohol ranges.

I would like to investigate more on the offset features before i begin to create a model for quality.

Following inferences can be drawn from the above facet plots created. Good quality ratings tend to have 1)Residual sugar content lesser than 20 and sulphates values are more dense within the range 0.3 to 0.7. 2) Citric acid values range between 0.3 and 0.5 with denser points and pH values between 3 and 3.5. 3) Free and total supfur dioxides values for good quality range are not so dense and are less focussed. Their effect is more dense in poor and medium quality ranges. 4)Fixed acidity values range between 5 to 8 and volatile acidity values range between 0.2 and 0.5 for for good quality wines.

I made some noticeable ranges for features owing towards good quality wines. These features may have little impact towards quality as they have weak or partial correlations with quality variable, but they cannot be neglected too. A good quality wine requires closer attention for even minute features.

MODEL BUILDING: I will build a linear regression model by adding features of interest in priority wise and see how well the model performs.

## 
## Call:
## lm(formula = quality_new ~ alcohol + density + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + volatile.acidity + fixed.acidity + 
##     pH + citric.acid + sulphates + residual.sugar, data = WQ)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8348 -0.4934 -0.0379  0.4637  3.1143 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.482e+02  1.880e+01   7.881 3.98e-15 ***
## alcohol               1.935e-01  2.422e-02   7.988 1.70e-15 ***
## density              -1.503e+02  1.907e+01  -7.879 4.04e-15 ***
## chlorides            -2.473e-01  5.465e-01  -0.452  0.65097    
## free.sulfur.dioxide   3.733e-03  8.441e-04   4.422 9.99e-06 ***
## total.sulfur.dioxide -2.857e-04  3.781e-04  -0.756  0.44979    
## volatile.acidity     -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## fixed.acidity         6.552e-02  2.087e-02   3.139  0.00171 ** 
## pH                    6.863e-01  1.054e-01   6.513 8.10e-11 ***
## citric.acid           2.209e-02  9.577e-02   0.231  0.81759    
## sulphates             6.315e-01  1.004e-01   6.291 3.44e-10 ***
## residual.sugar        8.148e-02  7.527e-03  10.825  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2803 
## F-statistic: 174.3 on 11 and 4886 DF,  p-value: < 2.2e-16
##                              2.5 %        97.5 %
## (Intercept)           1.113373e+02  1.850484e+02
## alcohol               1.460027e-01  2.409487e-01
## density              -1.876695e+02 -1.128988e+02
## chlorides            -1.318480e+00  8.239266e-01
## free.sulfur.dioxide   2.078263e-03  5.387267e-03
## total.sulfur.dioxide -1.026733e-03  4.552383e-04
## volatile.acidity     -2.086208e+00 -1.640146e+00
## fixed.acidity         2.460834e-02  1.064316e-01
## pH                    4.798045e-01  8.928830e-01
## citric.acid          -1.656148e-01  2.097952e-01
## sulphates             4.347243e-01  8.282287e-01
## residual.sugar        6.672953e-02  9.623608e-02

From inspecting p-values it is quite evident that the variables of interest such as reidual sugar, alcohol, density, volatile acidity has highest t-value. 8 out of 11 predictors seems to pass the significance test giving a 73% significant predictors accuracy. But apart from the variables of interest that i noticed through various plots, rest of the significant predictors imply that they should not be ignored in prediciting quality of whitr wine, they do play affect the quality in some way.None of the combinations yielded a good predicitve model of white wine quality. The R squared vale is very less showing bad predictive capability of this linear fit model. The model might perform better by fitting non-linear terms or adding additonal features to it.

MULTIVARIATE ANALYSIS

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

This multivariate analysis, helped me clarify the valuable insights gained in bivariate analysis section. A clear and precise conclusion can be laid on the relationship that i explored between density, alcohol and residual sugar variables in a single plot. Meidan feature analysis helped me strengthen the inference about density and alcohol consistency with quality variable at all levels, however relationship between chlorides and alcohol may change for different quality levels. Quality with rest of the variables which had weak correlations were also analyzed in this section and provided valuable insights.

Were there any interesting or surprising interactions between features?

I found the relationship between alcohol and quality to be interesting as throughout my analysis there was a positive correlation between alochol and quality ratings. But my theory concludes that there will be a threshold point for alcohol where further increase in its content will deteriorate the quality of wine as it will turn out to be too strong. Thus right amount of alochol content is considered a crucial factor for producing good quality wine and is one of the most important feature i found in my analysis.

Did you create any models with your dataset? Discuss the and limitations of your model.

Yes i built a linear Regression model to predict quality of white wine based on other features in the data. The model confirmed the significant variables that i found in my previous analysis. These variables of interest had high t-values. I obtained a 73% significant variables accuracy which also revealed that there are other additional variables apart from the ones i highlighted that contirbutes towards predicitng quality of white wine and should not be neglected. But the model generated low R- squared value which signifies that the model has poor prediciting capapbility. The reason for this could be requirement of additional variables in the data that are crucial enough to improve the predicitivity or may be i should fit a more flexible non-linear model.

FINAL PLOTS AND SUMMARY

PLOT 1

Description one

One of the reasons i chose this interesting plot was because i consider alcohol as a significant variable in determining the quality of white wine. As discussed before, these two variables are positively correlated. Quite evident from both the plots that as alcohol content increases, initially there is a decrease in quality ratings but i see a sudden shift in quality ratings as alcohol content started to increase from 6 upto 9.This signified the permissible content of alcohol that is necessary in making wine. Also according to my theory, if alcohol content further increases beyond 9 % by volume, there are chances that quality of wine will deteriorate rapidly as wine will taste very strong with high alcohol content. This provides me with a strong base that alcohol is a key variable for quality of wine produced.

PLOT 2

Description two

The reason for chosing this plot is that density was my second priority based variable of interest. Density shares correlation with quality, alcohol and residual sugar and plays a vital role in building suitable inferences to my analysis.Histogram here reveals that higher quality of wine has lower density levels preferably in and around 0.992 g/cm^3. Left portion of histogram from center, with lower density levels, shows higher quality rating colors.

PLOT 3

Description three

This visualization shares information about relationship between the variables of interest in single visualization. I converted residual sugar into categorical variable so that i can extract more insights at different levels of sugar. As evident from the graph, there is a downward shift in residual sugar levels as we proceed from highest density to lowest density level. Residual sugar content with purple dots ranging between 9.9 to 65.8 can be seen only at higher density levels and lower sugar content between 0.6 to 1.2 through red dots at lower density levels. As alcohol level rises, density level decreases and this is evident from the downward trend to right in the scatterplot. Thus alcohol is a less dense variable and sugar is a more dense variable in wine. Purple points which has highest sugar content are more dense at left portion of the graph, this reveals that lower alcohol content has higher sugar levels. This goes well along with my alcohol theory, as higher alcohol content will make wine strong and bitter. This plot reveals valuable insights to my whole wine analysis.

REFLECTION

This project made me proficient in using R for exploratory data analysis. It started with implementing basic fucntions like dim, str, head to understand the structure and nature of dataset that i am going to explore. This basic information was well carried by data preprocessing stage to refine the dataset by omitting NA observations and subsetting the data through elimination of unwanted variables. I also applied advance functions like summary, groupby, ggpairs and quantile ranges. I feel confident of creating plots such as scatterplots, boxplots, histograms etc..Clarity of the plots were highlighted through transperency, limiting the axes, jitter, and facet wrapping. I learnt a great deal about modelling in R. Finally i now know how to build and craft a project in R through R markdown.

Challenging area was in Mulitvariate section. I found it challenging to determine the strucuture of plots i need to build and also in determing which variables should be included and how to build an inference from the plot. Much time was devoted in gaining valuable insights and inter-relating the variables. This difficult process was made easy by examining the variables of interest obtained in bivariate section and focussing more on these variables in mulitvariate analysis to draw significant conclusions.

I was able to determine significant relationships through effective visualization and confirming the same through correlations observed between the variables. It was a fun and eager process when trying to analyze the same set of variables under different circumstances and drawing new significant information at each and every stage.

During data preprocessing stage, i just had a vivid thoughts about significant variables that i will come across in future analysis. I found it surprising in the beginning to note alcohol being more inofrmative and significant variable than pH. Later upon multiple analysis, i gained more knowledge and insights about why the idenfied significant variables playa vital role in quality of white wine.

I would like to dig deeper into wine making process and see if i can include additional variables that i find interesting in my analysis. This will further help in overcoming the model building process and generate a good predicitve model for this dataset. Also i would like to spend more time on model re-specification by transforming the variables and upon model finalisation through regression analysis , i would like to run model validation tests before recommeding the model for any user.